NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Automated Detection and Analysis of Data Practices Using A Real-World Corpus

https://doi.org/10.18653/v1/2024.findings-acl.271

Srinath, Mukund; Narayanan_Venkit, Pranav; Badillo, Maria; Schaub, Florian; Giles, C; Wilson, Shomir (January 2024, Association for Computational Linguistics)

Privacy policies are crucial for informing users about data practices, yet their length and complexity often deter users from reading them. In this paper, we propose an automated approach to identify and visualize data practices within privacy policies at different levels of detail. Leveraging crowd-sourced annotations from the ToS;DR platform, we experiment with various methods to match policy excerpts with predefined data practice descriptions. We further conduct a case study to evaluate our approach on a real-world policy, demonstrating its effectiveness in simplifying complex policies. Experiments show that our approach accurately matches data practice descriptions with policy excerpts, facilitating the presentation of simplified privacy information to users.
more » « less
Full Text Available
Privacy Lost and Found: An Investigation at Scale of Web Privacy Policy Availability

https://doi.org/10.1145/3573128.3604902

Srinath, Mukund; Sundareswara, Soundarya; Venkit, Pranav; Giles, C. Lee; Wilson, Shomir (August 2023, DocEng '23: Proceedings of the ACM Symposium on Document Engineering 2023)

Legal jurisdictions around the world require organisations to post privacy policies on their websites. However, in spite of laws such as GDPR and CCPA reinforcing this requirement, organisations sometimes do not comply, and a variety of semi-compliant failure modes exist. To investigate the landscape of web privacy policies, we crawl the privacy policies from 7 million organisation websites with the goal of identifying when policies are unavailable. We conduct a large-scale investigation of the availability of privacy policies and identify potential reasons for unavailability such as dead links, documents with empty content, documents that consist solely of placeholder text, and documents unavailable in the specific languages offered by their respective websites. We estimate the frequencies of these failure modes and the overall unavailability of privacy policies on the web and find that privacy policies URLs are only available in 34% of websites. Further, 1.37% of these URLs are broken links and 1.23% of the valid links lead to pages without a policy. Further, to enable investigation of privacy policies at scale, we use the capture-recapture technique to estimate the total number of English language privacy policies on the web and the distribution of these documents across top level domains and sectors of commerce. We estimate the lower bound on the number of English language privacy policies to be around 3 million. Finally, we release the CoLIPPs Corpus containing around 600k policies and their metadata consisting of policy URL, length, readability, sector of commerce, and policy crawl date.
more » « less
Full Text Available
Researchers’ Experiences in Analyzing Privacy Policies: Challenges and Opportunities

https://doi.org/10.56553/popets-2023-0111

Mhaidli, Abraham; Fidan, Selin; Doan, An; Herakovic, Gina; Srinath, Mukund; Matheson, Lee; Wilson, Shomir; Schaub, Florian (October 2023, Proceedings on Privacy Enhancing Technologies)

Companies' privacy policies and their contents are being analyzed for many reasons, including to assess the readability, usability, and utility of privacy policies; to extract and analyze data practices of apps and websites; to assess compliance of companies with relevant laws and their own privacy policies, and to develop tools and machine learning models to summarize and read policies. Despite the importance and interest in studying privacy policies from researchers, regulators, and privacy activists, few best practices or approaches have emerged and infrastructure and tool support is scarce or scattered. In order to provide insight into how researchers study privacy policies and the challenges they face when doing so, we conducted 26 interviews with researchers from various disciplines who have conducted research on privacy policies. We provide insights on a range of challenges around policy selection, policy retrieval, and policy content analysis, as well as multiple overarching challenges researchers experienced across the research process. Based on our findings, we discuss opportunities to better facilitate privacy policy research, including research directions for methodologically advancing privacy policy analysis, potential structural changes around privacy policies, and avenues for fostering an interdisciplinary research community and maturing the field.
more » « less
Full Text Available
Privacy Now or Never: Large-Scale Extraction and Analysis of Dates in Privacy Policy Text

https://doi.org/10.1145/3573128.3609342

Srinath, Mukund; Matheson, Lee; Venkit, Pranav Narayanan; Zanfir-Fortuna, Gabriela; Schaub, Florian; Giles, C. Lee; Wilson, Shomir (August 2023, DocEng '23: Proceedings of the ACM Symposium on Document Engineering 2023)

The General Data Protection Regulation (GDPR) and other recent privacy laws require organizations to post their privacy policies, and place specific expectations on organisations' privacy practices. Privacy policies take the form of documents written in natural language, and one of the expectations placed upon them is that they remain up to date. To investigate legal compliance with this recency requirement at a large scale, we create a novel pipeline that includes crawling, regex-based extraction, candidate date classification and date object creation to extract updated and effective dates from privacy policies written in English. We then analyze patterns in policy dates using four web crawls and find that only about 40% of privacy policies online contain a date, thereby making it difficult to assess their regulatory compliance. We also find that updates in privacy policies are temporally concentrated around passage of laws regulating digital privacy (such as the GDPR), and that more popular domains are more likely to have policy dates as well as more likely to update their policies regularly.
more » « less
Full Text Available
A large-scale exploration of terms of service documents on the web

https://doi.org/10.1145/3469096.3474940

Sundareswara, Soundarya Nurani; Srinath, Mukund; Wilson, Shomir; Giles, C. Lee (August 2021, Proceedings of the 21st ACM Symposium on Document Engineering)

Terms of service documents are a common feature of organizations' websites. Although there is no blanket requirement for organizations to provide these documents, their provision often serves essential legal purposes. Users of a website are expected to agree with the contents of a terms of service document, but users tend to ignore these documents as they are often lengthy and difficult to comprehend. As a step towards understanding the landscape of these documents at a large scale, we present a first-of-its-kind terms of service corpus containing 247,212 English language terms of service documents obtained from company websites sampled from Free Company Dataset. We examine the URLs and contents of the documents and find that some websites that purport to post terms of service actually do not provide them. We analyze reasons for unavailability and determine the overall availability of terms of service in a given set of website domains. We also identify that some websites provide an agreement that combines terms of service with a privacy policy, which is often an obligatory separate document. Using topic modeling, we analyze the themes in these combined documents by comparing them with themes found in separate terms of service and privacy policies. Results suggest that such single-page agreements miss some of the most prevalent topics available in typical privacy policies and terms of service documents and that many disproportionately cover privacy policy topics as compared to terms of service topics.
more » « less
Full Text Available
Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

https://doi.org/10.18653/v1/2021.acl-long.532

Srinath, Mukund; Wilson, Shomir; Giles, C Lee (January 2021, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing)

Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks.
more » « less
Full Text Available

Search for: All records